[serve][llm] Add TP*PP spacing to port offset for multi-replica deployments#58073
Merged
kouroshHakha merged 6 commits intoray-project:masterfrom Oct 31, 2025
Merged
Conversation
Multiplies replica_rank by tensor_parallel_size to prevent port collisions when scaling to 2+ replicas with TP≥2. Problem: PR ray-project#57771 fixed inter-replica port collisions by using replica_rank instead of defaulting to 0. However, it didn't account for the port space needed by TP workers within each replica. vLLM workers add their tp_rank (0, 1, ..., tp_size-1) to the base port at bind time (vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:790). Without proper spacing, consecutive replicas have overlapping port ranges: Replica 0 TP Worker 1: base + 0 + 1 = 50001 Replica 1 TP Worker 0: base + 1 + 0 = 50001 ← Collision Solution: Space replicas by tp_size ports to reserve room for all TP workers: Replica 0 uses ports: [base, base+1, ..., base+(tp_size-1)] Replica 1 uses ports: [base+tp_size, base+tp_size+1, ...] Impact: - Fixes port collisions when autoscaling to 2+ replicas with TP≥2 - Backward compatible: TP=1 multiplies by 1 (no-op) - DP deployments unchanged: vLLM handles spacing - Single replica deployments unchanged: no other replica to collide with Related: PR ray-project#57771, ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
kouroshHakha
requested changes
Oct 30, 2025
Contributor
kouroshHakha
left a comment
There was a problem hiding this comment.
This fix would still have a problem when we have TP2PP2, because it doesn't consider PP at all. You should use a generic num_device API which already exist in llmconfig --> engine_config.
| return rc.rank | ||
| # Multiply by tp_size to reserve ports for all TP workers | ||
| # Each TP worker will add its tp_rank (0, 1, ..., tp_size-1) | ||
| return rc.rank * tp_size |
Contributor
There was a problem hiding this comment.
you need to offset by tp * pp . Effectively you should use llm_config.get_engine_config().num_devices
Previous fix didn't quite get it right for TPXPPY scenario. Use llm_config.get_engine_config().num_devices instead of manually calculating tp_size, ensuring proper port spacing for both TP and PP workers. Fixes the case where PP workers also bind NIXL ports and need spacing in addition to TP workers. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
kouroshHakha
approved these changes
Oct 31, 2025
YoussefEssDS
pushed a commit
to YoussefEssDS/ray
that referenced
this pull request
Nov 8, 2025
…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multiply
replica_rankbynum_devices(tp × pp) to prevent port collisions when scaling to 2+ replicas with TP≥2 or PP≥2.Root Cause
PR #57771 fixed port collisions in
python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.pyfor TP/PP by using Ray Serve's replica_rank for port offsets instead of defaulting to 0. However, the implementation doesn't account for port spacing needed when each replica spawns multiple workers - so it could still lead to overlap.Main issue: Consecutive replicas get consecutive port offsets (0, 1, 2, ...), but each replica actually needs
num_devices(tp × pp) consecutive ports for its workers. This causes port ranges to overlap between replicas.Example: 2 replicas, TP=2
Example: 2 replicas, TP=2, PP=2
Solution:
Space replicas by
num_devices(tp × pp) ports to reserve room for all workers:Replica 0 uses ports: [base, base+1, ..., base+(num_devices-1)]
Replica 1 uses ports: [base+num_devices, base+num_devices+1, ...]
The fix uses
llm_config.get_engine_config().num_deviceswhich correctly accounts for both TP and PP workers.Impact:
Note (about Data Parallel)
DP deployments don't need this fix because vLLM already multiplies
data_parallel_rankbytp_sizefor the offset internally:So for DP, the spacing is automatic - but for
replica_rank, we do the offset multiplication ourselves since vLLM doesn't know about Ray Serve's replica concept. The fix usesnum_devicesinstead of justtp_sizeto ensure PP workers also get unique ports.Related: PR #57771, #55775, #58072